DATA1220-55, Fall 2024
2024-09-25
(General) Addition Rule
(General) Multiplication Rule
Dependence vs Independence
The probability of event A or event B occurring is the sum of the probability that A occurs and the probability that B occurs minus the probability that A and B occurs.
\[ \begin{aligned} P(A \operatorname{or} B) &= P(A) + P(B) - P(A \operatorname{and} B) \\ &= P(A) + P(B) - P(A \cup B) \\ &= P(A \cap B) \end{aligned} \]
When events A and B are disjoint, the probability of event A or event B occurring is just the sum of the probability that A occurs and the probability that B occurs, because the probability that event A and event B occurs is 0.
\[ \begin{aligned} P(A \operatorname{or} B) &= P(A) + P(B) - P(A \operatorname{and} B) \\ &= P(A) + P(B) \\ &= P(A \cap B) \end{aligned} \]
The probability of event A and event B occurring is the product of the probability that A occurs and the conditional probability that B occurs given that A has already occurred.
\[ \begin{aligned} P(A \operatorname{and} B) &= P(A) \times P(B \operatorname{given} A) \\ &= P(A) \times P(B | A) \\ &= P(A \cup B) \end{aligned} \]
If random process B is independent of random process A, then the probability of random process B does NOT vary based on the outcome of random process A
i.e. knowing the outcome of A does NOT provide additional information about the probability of B
Example: When listening to a playlist using a “true shuffle”, the probability that the next song will be by a particular artist does not change based on whether or not the last song played was also by that artist.
The probability of event A and event B occurring is the product of the probability that A occurs and the probability that B occurs, because the probability of B does not change based on the outcome of A.
\[ \begin{aligned} P(A \operatorname{and} B) &= P(A) \times P(B \operatorname{given} A) \\ &= P(A) \times P(B | A) \\ &= P(A) \times P(B) \\ &= P(A \cup B) \end{aligned} \]
Compare the conditional probabilities of B given the different possible outcomes of A. If \(P(B|A)\approx P(B)\) for all values of A, then the two random processes are likely independent.
Calculate the probability that event A and B occur under both an independence model (\(P(A \operatorname{and} B)=P(A)\times P(B)\)) and a dependence model (\(P(A \operatorname{and} B) = P(A) \times P(B|A)\).
Pew Research survey asked 2,373 randomly sampled registered voters about their…
Political affilitation (Democrat, Republican, Independent)
Whether they consider themselves a swing voter (Yes, No)
35% responded Independent, 23% identified as swing voters, and 11% identified as both
Are these events disjoint or non-disjoint?
What does the sample space look like?
What do the contingency tables look like?
What % of voters identify as an Independent or a swing voter?
What % of voters identify as neither an Independent nor a swing voter?
Are identifying as an Independent and identifying as a swing voter dependent or independent processes?
The American Community Survey (ACS) provides public data each year to give communities demographic information to plan investments and services. The 2010 ACS estimates that…
14.6% of Americans live below the poverty line
20.7% of Americans speak a language other than English at home
31.1% of Americans live below the poverty line or speak a language other than English at home
Are these events disjoint or non-disjoint?
What does the sample space look like?
What do the contingency tables look like?
What % of Americans live below the poverty line and speak a language other than English at home?
What % of Americans live below the poverty line and speak only English at home?
Are living below the poverty line and speaking a language other than English at home dependent or independent?
We will only be covering Chapter 4.1 on the normal distribution in your textbook
If you have an interest in math or statistics, you may want to read the rest of Chapter 4
4.2 - Geometric distribution
4.3 - Binomial distribution
4.4 - Negative binomial distribution
4.5 - Poisson distribution
Identify and describe the standard normal and normal distributions
Standardize normal distributions and calculate Z-scores
Calculate percentiles and exact probabilities
Apply the 68-95-99.7 Rule
Read a QQ-Plot (not in book)
Symmetric, unimodal, “bell-shaped”
Not as common as people think in real data
Strong assumption in small sample sizes ($)
Powerful statistical tests available when outcome approximates normal distribution
\(\mu\) (Greek letter mu) represents the mean
\(\sigma\) (Greek letter sigma) represents the standard deviation of the mean
\(N(\mu, \sigma)\) stands for a normal distribution with mean \(\mu\) and standard deviation \(\sigma\)
Vocabulary scores for 947 seventh-graders. Both histograms and density curves can be helpful in identifying normal distributions.
Dashed line is self-reported heights by females on OkCupid. Dark purple line is the normal distribution with the same mean and standard deviation. Light purple line is the US average.
Changing the mean shifts the “center” of the distribution. Changing the standard deviation alters the “width” of the distribution (i.e. variability).
A Z-score is the number of standard deviations a value falls above (when positive) or below (when negative) the mean of the data
Z-scores standardize a normal distribution by…
Centering the data at 0 by subtracting the mean from each score
Scaling the units of the data to 1 by dividing the centered data by the standard deviation
\[ \begin{aligned} Z&=\frac{\operatorname{observed value}-\operatorname{mean}}{\operatorname{standard deviation}} \\ &= \frac{x-\mu}{\sigma} \end{aligned} \]
SAT scores are normally distributed with \(\mu=1500\) and \(\sigma=300\) (\(N(\mu=1500, \sigma = 300)\))
ACT scores are normally distributed with \(\mu=21\) and \(\sigma=5\) (\(N(21, 5)\))
How do we compare normal distributions with different locations and scales?
Standardizing the data by converting values to Z-scores puts different distributions on the same scale.
The standard normal distribution is a normal distribution with \(\mu=0\) and \(\sigma=1\) (written \(N(\mu=0, \sigma=1)\))
Units of the standard normal distribution are standard deviations (Z-scores) (i.e. 1 unit = 1 SD)
Observations that are 2+ standard deviations from the mean are considered unusual
When data is (nearly) normally distributed…
~68% of the observations are within 1 standard deviation of the mean (\(\mu \pm \sigma\))
~95% of the observations are within 2 standard deviations of the mean (\(\mu \pm 2\sigma\))
99.7% of the observations are within 3 standard deviations of the mean (\(\mu \pm 3\sigma\))
The 68-95-99.7 Rule describes approximately what proportion of the observations should lie within 1, 2, and 3 standard deviations of the mean respectively, if the data is normally distributed
SAT scores have the distribution \(N(1500, 300)\)
~68% of scores will be 1200-1800
95% of scores will be 900-2100
99.7% of scores will be 600-2400
A percentile is the proportion or percentage of observations that fall below a given value in a distribution.
::: column ::: column
You can use a Z-Score Table to look up the percentile that corresponds to a particular Z-Score.
You can use a Z-Score Table to look up the probability of a particular Z-Score.
Sometimes the normal distribution is an acceptable approximation of a discrete numeric variable, but other distributions may be more appropriate.
DATA1220-55 Fall 2024, Class 12 | Updated: 2024-09-25 | Canvas | Campuswire